## Loading required package: lattice
## [1] 10886 16
## [1] "datetime" "season" "holiday" "workingday" "weather"
## [6] "temp" "atemp" "humidity" "windspeed" "casual"
## [11] "registered" "count" "hour" "month" "year"
## [16] "yearmonth"
## 'data.frame': 10886 obs. of 16 variables:
## $ datetime : POSIXct, format: "2011-01-01 00:00:00" "2011-01-01 01:00:00" ...
## $ season : Factor w/ 4 levels "Spring","Summer",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weather : Factor w/ 4 levels "Clear","Mist",..: 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 9.84 9.02 9.02 9.84 9.84 ...
## $ atemp : num 14.4 13.6 13.6 14.4 14.4 ...
## $ humidity : int 81 80 80 75 75 75 80 86 75 76 ...
## $ windspeed : num 0 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ count : int 16 40 32 13 1 1 2 3 8 14 ...
## $ hour : int 0 1 2 3 4 5 6 7 8 9 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ yearmonth : int 201101 201101 201101 201101 201101 201101 201101 201101 201101 201101 ...
## [1] "Clear" "Mist" "Rain" "Heavy Rain"
## [1] "Spring" "Summer" "Fall" "Winter"
## datetime season holiday
## Min. :2011-01-01 00:00:00 Spring:2686 Min. :0.00000
## 1st Qu.:2011-07-02 07:15:00 Summer:2733 1st Qu.:0.00000
## Median :2012-01-01 20:30:00 Fall :2733 Median :0.00000
## Mean :2011-12-27 05:18:05 Winter:2734 Mean :0.02857
## 3rd Qu.:2012-07-01 12:45:00 3rd Qu.:0.00000
## Max. :2012-12-19 23:00:00 Max. :1.00000
## workingday weather temp atemp
## Min. :0.0000 Clear :7192 Min. : 0.82 Min. : 0.76
## 1st Qu.:0.0000 Mist :2834 1st Qu.:13.94 1st Qu.:16.66
## Median :1.0000 Rain : 859 Median :20.50 Median :24.24
## Mean :0.6809 Heavy Rain: 1 Mean :20.23 Mean :23.66
## 3rd Qu.:1.0000 3rd Qu.:26.24 3rd Qu.:31.06
## Max. :1.0000 Max. :41.00 Max. :45.45
## humidity windspeed casual registered
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 47.00 1st Qu.: 7.002 1st Qu.: 4.00 1st Qu.: 36.0
## Median : 62.00 Median :12.998 Median : 17.00 Median :118.0
## Mean : 61.89 Mean :12.799 Mean : 36.02 Mean :155.6
## 3rd Qu.: 77.00 3rd Qu.:16.998 3rd Qu.: 49.00 3rd Qu.:222.0
## Max. :100.00 Max. :56.997 Max. :367.00 Max. :886.0
## count hour month year
## Min. : 1.0 Min. : 0.00 Min. : 1.000 Min. :2011
## 1st Qu.: 42.0 1st Qu.: 6.00 1st Qu.: 4.000 1st Qu.:2011
## Median :145.0 Median :12.00 Median : 7.000 Median :2012
## Mean :191.6 Mean :11.54 Mean : 6.521 Mean :2012
## 3rd Qu.:284.0 3rd Qu.:18.00 3rd Qu.:10.000 3rd Qu.:2012
## Max. :977.0 Max. :23.00 Max. :12.000 Max. :2012
## yearmonth
## Min. :201101
## 1st Qu.:201107
## Median :201201
## Mean :201157
## 3rd Qu.:201207
## Max. :201212
## [1] "count: 181.14"
## [1] "registered: 151.04"
## [1] "casual: 49.96"
## [1] "windspeed: 8.16"
## [1] "humidity: 19.25"
## [1] "atemp: 8.47"
## [1] "temp: 7.79"
The date range is from Jan 01, 2011 12am to Dec 19, 2012 11pm and given that 2012 was a leap year we should have 17,255 hours between the two dates. However, we have only 10,886 entries which means that we can count on either this being a sample of the timeframe (likely as it’s called train.csv) or that maybe they excluded hours that contained no riders. We can test for that possiblity…
sum(bikeShare$count == 0)
## [1] 0
So they either don’t display hours that have no riders or there are always riders. Nevertheless, we can count on the fact that we certainly don’t have data for every hour between our selected dates. In fact we only have 63.09%
of the data.
Since there is a single entry for Heavy Rain in the Weather column we will either have to drop that column or combine it with Rain if we are to use it.
## Spring Summer Fall Winter
## 2686 2733 2733 2734
It’s interesting how closely matched the seasons are. If this was just entries that contained riders and not sample data from the timeframe, you wouldn’t expect those numbers to be so closely inline with one another. Then again, even as a sample they must have sampled from each season to come up with these results.
As mentioned earlier, all of the entries contain at least one rider and each entry represents one hour. Median ridership is 145 with a maximum number of riders of 977.
At first glance it looks like environmental variables (temp, atemp , humidity and windspeed) are close to a normal distribution whereas all the ridership variables (casual, registered and count) are very right skewed. Once we start plotting we’ll see this better.
After transforming the long tailed data to get a better understanding of the ridership…
The transformed ridership appears unimodal with a rise as we move past 50 to the peak at around 175 riders after which we have a steady decline.
There are really three sections to this dataset, Time, Weather and Ridership. Since we’ve just reviewed ridership let’s take a glance at the other two.
The spread across the dates seems fairly constant over the period of 719 days.
After overlaying the distribution line with the temperature, we concluded that the temperature and dates don’t really tell us much since the expected distribution appears parametric.
Which I guess is something - though not exactly interesting.
It looks like the weather is pretty good in DC. This will skew our results with so many clear periods in relation to others when we compare the impact of weather to ridership. In other words, if we state that there are more riders during clear hours we need to account for the fact that there are that many more clear hours as well. We’ll use averages when dealing with weather in this case.
note: the ggpairs output took a very long time to run on this over 10k line dataset. I included an export of the file labeled as ‘ggpairs.png’ as part of this package.
Let’s see how the ridership stacks up by month.
Though there are more cyclists in the summer months as you might expect, the increase from January to December is likely a reflection of growth in the company as a whole over the two year period we are looking at.
Let’s break that out over the two year mark to see if that’s the case.
There was an increase in ridership, let’s quantifiy that a bit.
Total ridership increased by 66.69% from 2011 through 2012, confirming the suggestion that growth rates contributed to the difference in our boxplots.
Here we normalized the ridership using the median number of riders as precipitation increases. As a cyclist, I always suspected that we have behavorial traits that lean toward the masochistic, but clearly DC riders are a very strange lot. Ridership decreases as we would suspect from clear weather to mist to rain, but then when the weather gets tough the tough get going. In a flash of inspriation though I recall that there is only one data point for Heavy Rain so I recreated the graph this way.
These cyclists aren’t the animals I first suspected. Given these traits, I would venture to guess that a lot of riders just got caught in the rain. Since the weather seems to be a factor of how much precipitation was inflicted on our riders, maybe humidity would be a better indicator.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 47.00 62.00 61.89 77.00 100.00
The number of riders certainly drops off as humidity increases.
DC has so few entries that have humidity levels under 20 (0.63%) and since the data was skewed by these outliers, I felt it was safe to remove them.
This is an interesting graph and I’d like to look into it a little more. I wonder if this affects casual riders or registered riders more.
The slope of the casual line looks much flatter than the registered users.
## (Intercept) humidity
## 284.484490 -2.083369
## (Intercept) humidity
## 91.9611188 -0.9038999
Does this lead us to believe that humidity levels affect registered users moreso than casual ones? That seems counterintuitive. I would suspect that registered riders use the bike system for commuting and would be required to ride into work no matter what the humidity, whereas casual riders would be more spontaneous and therefore pursuaded moreso by changes in humidity. The data is showing us an unexpected relationship.
Since humidity is something you can’t really see before you head outside, I’m left wondering if windspeed would play a greater role.
The plot thickens. It’s plain to see that there’s a shift in the data when we move from one end of the graph to the other. Since the shift seems to flatten out when the windspeed reaches around 25-30 mph and a linear model looks like a good fit on the left, let’s plot two different linear models on each section.
Riders don’t mind the wind and seem to actually graviate toward riding when it starts blowing. However, at around 25-30 mph, they start to drop off. It’s much more of a deterrent for registered riders than the casual group. As the windspeed increases, the mean becomes more sporatic too, which may be an indicator that we are seeing fewer entries in that range.
As the “Feels Like” temperature increases so do the number of riders…
It looks like it’s starting to level out at around 45C, otherwise as the temperature goes up we have more riders. There must be a point of diminishing returns, but we need higher temperatures to find it.
As the weather warms the scatter plot turns green, while it turns blue as it cools. We can see the growth of the company altering our results slightly as there are fewer riders in the Spring compared to Winter. Most notable though is the bimodal pattern we see in the data. Since this is hourly we are almost assuredly looking at the impact of commuters. It’s interesting that the spike in commuter traffic is smaller in the mornings than in the evenings, which may indicate that though commuters go into work at different times most leave at around the same time.
After removing the data point for Heavy Rain, we split up the results to review the monthly impact of casual vs. registered riders.
Highlights:
This has me thinking about how the different months can make a person feel like the weather may be better or worse as you enter it. I know from personal experience that 15C can feel warm in January and bitter cold in September. I wonder how riders percieve the actual temperature based on the season.
I expected some of these results, but two things jump out at me. First, I would expect that Winter would contain the coldest temps and that Summer would contain the hottest. However…
## Source: local data frame [4 x 6]
##
## season min mean median max n
## 1 Spring 0.82 12.53049 12.30 29.52 2686
## 2 Winter 5.74 16.64924 16.40 30.34 2734
## 3 Summer 9.84 22.82348 22.96 38.54 2733
## 4 Fall 15.58 28.78911 28.70 41.00 2733
Looking closely, Spring has the lowest Temperature (0.82C) and Fall has the highest (41C). Many times data can tell you things you don’t intuitively understand. When we think about the Summer or Winter seasons as being the extremes you might think it’s because the average number of days are warmer in Summer or colder in Winter. The data from Washington, DC doesn’t support this. If it was closer to the equator it might make more sense as temperature extremes grow as you reach the poles but this is unexplainable…
Maybe the “Feels Like” temperature is different?
## Source: local data frame [4 x 6]
##
## season min mean median max n
## 1 Spring 0.760 15.22896 14.395 32.575 2686
## 2 Winter 7.575 20.05991 20.455 34.090 2734
## 3 Summer 11.365 26.64710 26.515 43.940 2733
## 4 Fall 12.120 32.54078 33.335 45.455 2733
I concede. Spring and Fall are the most extreme seasons, whereas Summer and Winter are comparatively mild.
editors note: This could be due to all the hot air being blown around by the enormous influx of politicians in the area - more research is needed.
These density plots confirm what we’ve seen earlier with regard to the number of riders during different types of weather. Cyclists are more likely to ride in inclement weather during the Spring or Summer than in the Fall or Winter. Even though, as we determined earlier, Fall is the warmest season and Spring the coldest.
## [1] "Spring - Rain density probability: 13.57%"
## [1] "Percentage of Entries that had Rain in the Spring: 7.86%"
## [1] "Fall - Rain density probability: 9.48%"
## [1] "Percentage of Entries that had Rain in the Fall: 7.28%"
## [1] "Summer - Rain density probability: 12.88%"
## [1] "Percentage of Entries that had Rain in the Summer: 8.2%"
## [1] "Winter - Rain density probability: 7.54%"
## [1] "Percentage of Entries that had Rain in the Winter: 8.23%"
The above just quantifies this a bit.
I broke the linear models up into casual and registered users as separate entities and trained for each using the following formulas:
Casual returned an R-Squared value of 0.461 and Registered only recieved 0.31 so we assume it’s not going to be a great match.
results were mostly horrible so I opted for a binary tree instead (rpart)…
## Loading required package: rpart
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
Here’s the original training data:
and the test data after it’s been trained:
Both shapes should look similar so we don’t have a very good algorithm yet. We should go back to looking at ways to manipulate the data to make it more linear or use a different algorithm altogether such as RandomForest for this data set.
Looking at the top graph we see a red bimodal line on all days. It’s more extreme as we look at just working days (blue line). This changes when we look at the holidays only, as there is a smooth line that simply rises and drops throughout the day.
The heavy black line is the mean of all riders across all the data.
In the second image, we break this out to just working days and further into casual and registered riders. Here it’s more obvious that registered riders are in fact, commuting. Commute times peak shortly before 8am and again near 6pm. There is a slight rise around 1pm which I suspect are those using the bikes for lunch. In contrast, we don’t see any of this behavior in casual riders.
Though causual riders bike more on holidays, registered users bike less and the peaks are much less defined. This yeilds more evidence that registed riders are using these bikes for commuting.
Notice that in the lower right, as the temperature goes up it reaches a point where there are more and more riders. Eventually there’s a point (around 33C) that really takes off and there’s never less than 75 riders. I suspect that this is due to these temperatures only ever occuring during the day. It may be interesting that Mist and Rain seems to hover around certain temperatures, for instance at the extreme outliers we always see clear weather. However, you might recall that at the start of this project we saw that there were significantly more clear days, so I won’t read too much into that.
Intuitively we understand that as humidity increases ridership should drop regardless of the temperature. Though the data supports this view the really interesting thing is how much riders avoid the humidity based on season. In the above, we can see that in the Spring people really don’t mind the humidity (the mean line is basically flat) whereas the steep line in the Fall shows how much these changes in humidity cause people to avoid cycling.
The Bicycle Sharing dataset contains over 10,000 records recorded hourly from 2011-2012. Begining with the total riders (count) and moving on to the individual columns we were able to develop questions which were answered as we crawled through the dataset. Eventually we were able to develop a picture of the dataset and observe how it was built and how it related to ridership.
I was surprised to learn that temperature extremes actually occur in the Spring and Fall seasons in Washington DC from 2011-2012…
This is contrary to Wikipedia [http://en.wikipedia.org/wiki/Season]
“Meteorological seasons are reckoned by temperature, with summer being the hottest quarter of the year and winter the coldest quarter of the year.”
…and my intuition.
The growth rate, the seasons, the registered riders all play a significant role in the final counts. However the temperature, weather, windspeed and humidity have very little impact until they are paired with specific months.
The test data has been pulled from within the two years so you should treat this data as just missing portions of an already intact dataset. Further to that we see discrepencies between the working days and holidays that should be figured into our final prediction model. Working days and holidays are boolean values and since we have no entries with both (which would be weird) we have to assume that if both are FALSE it’s probably a weekend. Given that here’s the algorithm:
To predict a given hour we look backwards through the training data from the point we are predicting for the same workingday, holiday and hour values. The first point we come to we’ll record the count and day for. We then perform the same method in the forward direction.
We will then weigh these points as linear from the past to the future and record the entry as it passes through the test point we are looking for. In this way we are incorporating the most relevant parts of the dataset.
Here’s the results:
and the original dataset
I’m happy with these results.